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Abstract 

A computational system for lattice QCD with exact chiral symmetry is described. 
The platform is a home-made Linux PC cluster, built with off-the-shelf components. At 
present the system constitutes of 64 nodes, with each node consisting of one Pentium 4 
processor (1.6/2.0/2.5 GHz), one Gbyte of PC800/1066 RDRAM, one 40/80/120 Gbyte 
hard disk, and a network card. The computationally intensive parts of our program 
are written in SSE2 codes. The speed of our system is estimated to be 70 Gflops, 
and its price/performance ratio is better than $1.0/Mflops for 64-bit (double precision) 
computations in quenched QCD. We discuss how to optimize its hardware and software 
for computing quark propagators via the overlap Dirac operator. 
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1 Introduction 



Our objective is to extract physics from lattice QCD with possibly minimal amount of com- 
putations. Obviously, the required computing power exceeds that of any desktop personal 
computer currently available in the market. Thus, for one without supercomputer resources, 
building a computational system P seems to be inevitable if one really wishes to pursue a 
meaningful number of any physical quantity from lattice QCD. However, the feasibility of 
such a project depends not only on the funding, but also on the theoretical advancement of 
the subject, namely, the realization of exact chiral symmetry on the lattice [21 E]- Now, if we 
also take into account of the current price/performance of PC hardware components (CPU 
+ RAM + hard disk^), it seems to be the right time to rejuvenate the project P with a new 
goal - to build a computational system for lattice QCD with exact chiral symmetry. In this 
paper, we outline the essential features of a Linux PC cluster (64 nodes) which has been built 
at National Taiwan University. In particular, we discuss how to optimize its hardware and 
software for lattice QCD with overlap Dirac operator. 

First, we start from quenched QCD calculations (i.e., ignoring any internal quark loops 
by setting detD = 1). Thus, our first task is to compute quark propagators in the gluon field 
background, for a sequence of configurations generated stochastically with weight exp(— 
{Ag : pure gluon action). Then the hardronic observables such as meson and baryon corre- 
lation functions can be constructed, and from which the hadron masses and decay constants 
can be extracted. We use the Creutz-Cabbibo-Marinari heat bath algorithm jUE] to generate 
ensembles of SU{3) gauge configurations. 

The computation of quark propagators depends on the scheme of lattice fermions, the hard 
core of lattice QCD. In general, one requires that any quark propagator coupling to physical 
hadrons must be of the form [H] 

{D, + m,y\ (1) 

where niq is the bare quark mass, and is a chirally symmetric and anti-hermitian Dirac 
operator [ -Dc75 + 75-Dc = and {iDcY = iDc ]. Here we assume that Dc is doubler-free, has 
correct continuum behavior, and D = Dc{l +raDc)~^ is exponentially local for smooth gauge 
backgrounds. Note that the way rriq coupling to Dc is the same as that in the continuum. 
The chiral symmetry of Dc (even at finite lattice spacing) is the crucial feature of any quark 
coupling to physical hadrons. Otherwise, one could hardly reproduce the low energy strong 
interaction phenomenology from lattice QCD. 

For any massless lattice Dirac operator D satisfying the Ginsparg- Wilson relation [7j 

D75 + 75^ = 2raDj,D , (2) 

it can be written as [S] 

D = Dcil+raDc)-' , 
and the bare quark mass is naturally added to the Dc in the numerator 0, 

D{mg) = {Dc + m,)(l + raDc)"^ . 

^The emergence of low-price and high-capacity (> 100 Gbyte) IDE hard disk turns out to be also rather 
crucial for this project, since the data storage is enormous. 
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Then the quenched quark propagator becomes 

{D, + m,)-i = (1 - rmga)-^[D{mg)~^ - ra] (3) 

If we fix one of the end points at (0, 0) and use the Hermitcity = 75D75, then only 12 
(3 colors times 4 Dirac indices) columns of 

D{m,)-' = D\m,){D{m,)D\m,)}-' (4) 

are needed for computing the time correlation functions of hadrons. Now our problem is how 
to optimize a PC cluster to compute D{mq)~^ for a set of bare quark masses. 

The outline of this paper is as follows. In Section 2, we briefly review our scheme of 
computing quark propagators via the overlap Dirac operator. The details have been given 
in Ref. jS]. In Section 3, we discuss a simple scheme of memory management for the nested 
conjugate gradient loops. In Section 4, we discuss how to implement the SSE2 codes for the 
computationally intense parts of our program. In Section 5, the performance of our system is 
measured in terms of a number of tests pertaining to the computation of quark propagators. 
In Section 6, we conclude with some remarks and outlooks. 



2 Computational Scheme for quark propagators 

The massless overlap Dirac operator [3] reads as 



D = nioa M I + 75 



(5) 



where Hy^ denotes the Hermitian Wilson-Dirac operator with a negative parameter —mo, 

= 'j^D^ = 75(-mo + 7^t^ + W) , (6) 

7^t^ the naive fermion operator, and W the Wilson term. Then D satisfies the Ginsparg- 
Wilson relation Q with r = l/(2mo). In this paper, we always fix mo = 1.3 for our compu- 
tations. Details of our implementation have been given in Ref. jH] . 
Basically, we need to solve the following linear system 



D{mg)D\mg)Y 



m„ 



2m^ 



m^ 
2 



(75 ± 1) 



Y = 1 



(7) 



by conjugate gradient (CG). Then the quark propagators can be obtained through (jH). With 
Zolotarev optimal rational approximation PIU^JElEl to the multiplication^ 



Y, K 



hwihl + C2n) 



^ ,0 Y = h^{hl + C2n) y] biZi 

1 K + C21-1 



^Note that the Zolotarev optimal rational polynomial in Eq. ^ is in the form r'"'"^ which is different 
from r^""^'"^ used in Ref. We refer to Ref. ^31 for further discussions. 
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can be evaluated by invoking another conjugate gradient process to the hnear systems 



{hl + C2i^i)Zi = Y, l = l,---,n. (9) 

where 



,2(M!_. 



, , nr=/(c2.-c2,-i) 

'^l - "OT^T; 7 ^ 

[li=l,ijLl{C2i~l - C2I-.I, 







2A " l + C2^-i 

l + AfJ 1 + C2Z 

2n+l e^f^-K' 
^ l2n+l''^ 



^ n Q2 ( {2l~l)K' , 

Here denotes the elhptic theta function, and the Jacobian elhptic function sn(u; k') is 
defined by the elhptic integral 

sn dt 



u 

y'(l-t2)(i_K'2t2) 
and K' is the complete elliptic integral of the first kind with modulus /t', 



J(l-t^){l- k'H^ 



where k,' = yl — 1/b, b = ^max/Kniny -^max '^min maximum and the minimum 

of the eigenvalues of H^. 

Instead of solving each Zi individually, one can use multi-shift CG algorithm P^llSj . and 
obtain all Zi altogether, with only a small fraction of the total time what one had computed 
each Zi separately. Evidently, one can also apply multi-shift CG algorithm to ((Tj) to obtain 
several quark propagators with different bare quark masses. 

In order to improve the accuracy of the rational approximation as well as to reduce the 
number of iterations in the inner CG loop, it is crucial to narrow the interval [1, b] by projecting 
out the largest and some low- lying eigenmodes of H^. We use Arnoldi algorithm to project 
these eigenmodes. Denoting these eigenmodes by 

H^Uj = XjUj, j = l,---,k, (10) 

then we project the linear systems Q to the complement of the vector space spanned by 
these eigenmodes 

k 

(hi + C2i.i)Zi = F = (1 - ^ u,u])Y , I = !,■■■, n. (11) 
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In the set of projected eigenvalues of H^, = l,---,k}, we use A^^^, and A^^^ to 

denote the least upper bound and the greatest lower bound of the eigenvalues of H^, where 

k 

= - ^ XjUju] . 
i=i 

Then the eigenvalues of 

l2 _ fT2 /\2 

w wl min 



fall into the interval (1,&), h = Xmax/^min- 

Now the matrix-vector multiplication (jHl) can be expressed in terms of the projected eigen- 
modes (fTnjl plus the solution obtained from the conjugate gradient loop (fTTjl in the comple- 
mentary vector space, i.e., 

1 1 - A 

H^^Y ~ H^hl + C2n) E bi^i + E ^^.-^^ = S (12) 



1=1 i=lv"J 

Then the breaking of exact chiral symmetry Q can be measured in terms of 

\S^S-Y^Y\ 

^ = — yW — • ^^^^ 

In practice, one has no difficulties to attain a < 10~^^ for most gauge configurations on a 
finite lattice [12]. 

Now the computation of quark propagators involves two nested conjugate gradient loops: 
the so-called inner CG loop (fTTj) . and the outer CG loop ((Tj). The inner CG loop is the price 
what one pays for preserving the exact chiral symmetry at finite lattice spacing. 



3 Memory management 

In this section we discuss how to configure the hardware and software of a PC cluster such 
that it can attain the optimal price/performance for the execution of the nested CG loops, 
(HD and (HH). 

First, we examine how much memory is required for computing one of the 12 columns of 
the quark propagators for a set of bare quark masses, since each column can be computed 
independently. If the required memory can be allocated in a single node, then each node can 
be assigned to work on one of the 12 columns of the quark propagators. Then the maximum 
speed of a PC cluster is attained since there is no communication overheads. Nevertheless, the 
memory (RDRAM) is the most expensive component, thus its amount should be minimized 
even though the maximum memory at each node can be up to 4 Gbyte. On the other hand, if 
one distributes the components of the nested CG loops across the nodes and performs parallel 
computations (with MPI) through a fast network switch, then the memory at each node can 
be minimal. However, the cost of a fast network switch and its accessories is rather expensive, 
and also the efficiency of the entire system will be greatly reduced due to the communication 
overheads. Therefore, to optimize the price/performance of the PC cluster relies on what is 
the minimal memory required for computing one of the 12 columns of the quark propagators. 
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Let Ng = X T denote the total number of lattice sites, then each column oi D ^ with 
double complex (16 bytes) entries takes 

iV^ = A^, X 12 X 16 bytes. (14) 

Using Ny or one column as the unit, we list the memory space of all components during the 
execution of the nested CG loops : 

• Gauge links: 3. 

• Number of projected low-lying eigenmodes: k 

• Quark propagators [ i.e., y in ((Tj) ] of A''^ masses: Nm/2. 
(Note that each Y only takes 1/2 column since it is chiral.) 

• Conjugate gradient vectors in the CG algorithm: N„i/2. 

• Residual vector for the outer CG loop: 1/2. 

• The vector Yi (of the smallest bare quark mass) at the interface between the inner and 
the outer CG loops: 1. 

• The inner CG loop: 2n + 3 ( where n is the degree of Zolotarev rational polynomial), 
which consists of 

(i) {Zi} vectors: n; 

(ii) Conjugate gradient vectors {wi}: n; 

(iii) Residual vector (r): 1; 

(iv) \wi): 1; 

(v) if^M: 1. 

Therefore, the memory space for all components of the nested CG loops is 

Ncg = {Nm + 1/2) + {2n + 3) + k + 3 = Nm + 2n + k + 6.5 (columns) (15) 

A schematic diagram of all components of the nested CG loops is sketched in Fig. 1. 

Suppose we wish to compute quark propagators on the 16^ x 32 lattice (at P = 6.0), with 
parameters k = 16, n = 16, and A''^ = 16. Then, according to (fT^ and (fT3j). 

Ny - 0.024 Gbyte , 
Ncg = 70.5 columns , 

the required memory for all components of the nested CG loops is 

Ncg X N„ ~ 70.5 X 0.024 = 1.7 Gbyte 

This seems to imply that one should install'^ four stripes of 512 Mbyte modules (i.e. total 
2 Gbyte) at each node, if one wishes to let each node compute independently, and to attain 
the maximum speed of the PC cluster. However, this is a rather expensive solution at this 

•^At present, most Pentium 4 motherboards designed for housing PC800 RDRAM have 4 memory slots. 
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Figure 1: A schmatic diagram of all memory allocations for the nested CG loops. 

moment, in view of the current price of 512 Mbyte modules. On the other hand, if one 
distributes the components of the nested CG loops across the nodes and performs parallel 
computations (with MPI) through a fast network switch, then the price/performance seems 
to be even worse than the former solution. 

Fortunately, we observe that not all column vectors are used simultaneously at any step of 
the nested CG loops, and also the computationally intense part is at the inner CG loop. Thus 
we can use the hard disk as the virtual memory for the storage of the intermediate solution 
vectors and their conjugate gradient vectors (Yo.,Po-,(j = l,...Nm) at each iteration of the 
outer CG loop, while the CPU is working on the inner CG loop. Then the minimal physical 
memory required at each node can be greatly reduced. Also, the projected eigenmodes are 
not required to be kept inside the memory, since they are only needed at the start of the inner 
CG loop to compute Yi (for the smallest bare quark mass), 

and 

where SpYi is only needed for computing S fll2|) at the completion of the inner CG loop. Thus 
one has the options to keep the vector EpYi inside the memory during the entire inner CG 
loop or save it to the hard disk and then retrieve it at the completion of the inner CG loop. 
Further, since Yi is only needed at the start of the inner CG loop, so it can share the same 
memory location with the residual vector r. 

Now it is clear that the minimum memory at each node (without suffering a substantial 
loss in the performance) is 

AT™'' = (2n + 3) + 3 = 2/2 + 6 (columns) , 

which suffices to accommodate the link variables and all relevant vectors for the inner CG 
loop. After the completion of the inner CG loop and the vector S fjl2j) is computed, the 
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memory space of 2?t, + 3 column vectors is released, and the vectors {Vo-} and {Pa} of the 
outer CG loop can be read from the hard disk, which are then updated to new values according 
to the CG algorithm. 

With this simple scheme of memory management, the minimal memory for computing one 
of the 12 columns of the quark propagators (for a set of bare quark masses) on the 16'^ x 32 
lattice with n = 16 (degree of Zolotarev rational polynomial) becomes 

iV^^^" X iV^ = 38 X 0.024 = 0.912 Gbyte. 

Thus the computation can be performed at a single node with one Gbyte of memory, which 
can be implemented by installing four stripes of 256 Mbyte memory modules, a much more 
economic solution than using 4 x 512 Mbyte modules. Moreover, the time for disk I/O (at the 
interface of inner and outer CG loops) only constitutes a few percent of the total time for the 
execution of the entire nested CG loops (Table 121). This is the optimal memory configuration 
for a PC cluster to compute quark propagators on the 16^ x 32 lattice, which of course is not 
necessarily the optimal one for other lattice sizes. However, our simple scheme of memory 
management for the nested CG loops should be applicable to any lattice sizes, as well as to 
other systems. 

In passing, we emphasize that the Zolotarev optimal rational approximation to (if^)"^/^ 
plays a crucial role to minimize the number of vectors required for the inner CG loop. If one 
had used other rational approximations, then it would require a very large n to preserve exact 
chiral symmetry to a high precision (e.g., a < 10~^^). In that case, it would be impossible to 
attain the optimal price/performance as what has been outlined above. 

4 The SSE2 acceleration 

With the optimal memory allocation for each node, we further enhance the performance of 
our lattice QCD codes (in Fortran) by rewriting its computationally intense parts in the SSE2 
assembly codes of Pentium 4. In this section, we briefly review the basic features of the vector 
unit (SSE2) of Pentium 4, and then describe how to implement SSE2 codes in our lattice QCD 
program. 

4.1 The basic features of SSE2 

The simplest and the most efficient scheme of parallel computation is Single Instruction Mul- 
tiple Data (SIMD). It can be implemented inside CPU through a set of long registers. If each 
register can accommodate several (say, s) data entries, then any operation (addition, subtrac- 
tion, multiplication and division) on these registers will act on all data entries in parallel, thus 
yields the speed-up by a factor of s comparing with normal registers. A schematic diagram 
is shown in Fig. |21 

Even though Intel had implemented the vector unit in their CPUs since Pentium-MMX 
series, only in the most recent IA-32 Pentium 4 and the advanced IA-64 Itanium, the ar- 
chitecture has been extended to SSE2 (Streamed SIMD Extension 2) to incorporate double 
precision data entries. 
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Figure 2: Double precision multiplication performed by the SSE2 instruction in the SIMD 
registers. 

The Pentium 4 processor has eight registers ( %xmmO, %xmml, . . . , %xmm7 ) for SIMD 
operations PHj- Each register is 128 bits wide and can accomodate 4 integers, or 4 single- 
precision or 2 double-precision floating point numbers. Since we always use double precision 
floating point numbers in our program, the execution speed of our program can be almost 
doubled if SSE2 is turned on judiciously in the computationally intensive parts. Note that 
SSE2 complies with the IEEE 32-bit and 64-bit arithmetic, thus the precision is lower than 
the extended 80-bit precision of the normal registers in Pentium 4. However, the difference is 
less than one part in 10^^ (double precision), thus is negligible in our computations. 

4.2 How to implement SSE2 codes in Fortran programs 

Since our lattice QCD codes were originally written in Fortran 77, it would be natural if 
SSE2 codes can be directly embedded in our Fortran program. However, to our knowledge, 
the Fortran compilers currently available in the market do not support the option of inlining 
SSE2 codes. Moreover, for optimal performance of SSE2, the data should be aligned to 
16-byte memory boundary. This can be easily carried out in C. Therefore our strategy to 
implement SSE2 codes is rewrite the main program unit in C such that the data arrays are 
allocated and aligned to 16 bytes memory boundary, then the SSE2 codes are embedded in 
C subroutines which are then called by original routines in Fortran. 

Of course, if one has written lattice QCD codes in C, then the SSE2 codes can be embedded 
in C routines directly, without dealing with the interface of C and Fortran. 

In the following, we illustrate our scheme of implementing SSE2 codes with an example 
program. The default compilers are gcc and g77 in Linux. 

program main 

implicit none 

integer n 

parameter (n=100) 

double precision r(n) , v(n) 

call vxzero(n, r, v) 

end 

subroutine vxzero(n, r, v) 
implicit none 
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integer n 

double precision c, r(*), v(*) 

call vadd(n, c, r, v) 

end 



r = r + c V 
c: scalar 
r, v: vector 



Here the Fortran main program calls the subroutine vxzero which in turn calls a computa- 
tionally intensive routine vadd. 

First, we rewrite the main program in C, with the data arrays allocated and properly 
aligned. 

#include <malloc.h> 

int main(int argc, char **argv) 

{ 

int n=100; 
double *r, *v; 
/* setup the environment for Fortran */ 
f _setarg(argc , argv) ; 
f _setsig() ; 
f.initO ; 

/* allocate r & v, and align them to 16-byte boundary */ 

r = memalign(16, n*sizeof (double) ) ; 

V = memalign(16, n*sizeof (double) ) ; 
/* call the Fortran subroutine */ 

vxzero_(&n, r, v) ; 
/* shutdown the I/O channels of Fortran */ 

f _exit ; 

exit(O) ; 

return 0; 

} 



The function call memalign() dynamically allocates 16 bytes aligned pointers r and v. Then 
the aligned arrays v[] and r[] can be passed to C subroutines for SSE2 operations. 

Next we rewrite the computationally intensive routine vadd in C with embedded SSE2 
codes. 

/* load variable a into yo°/oxminO */ 
#define sse_load(a) \ 

__asm volatile.. ( "movapd °/oO, %7oxmmO" :: "m" (a)) 



/* r = r + yo%xmmO x v */ 

#define sse.add(r, v) \ 

asm volatile ( \ 

"movapd 7.1, "/J.xmml \n\t" \ 



9 



"movapd 7.2, %y.xinm2 \n\t" \ 

"mulpd %y.xminO, nxmm2 \n\t" \ 

"addpd %y.xminl, %yoXmm2 \n\t" \ 

"movapd y.y.xnun2, y.O" \ 

: \ 
/* store to address (r) , which is indexed as */ \ 

"=m" (r) \ 

: \ 
/* load from address (r) and (v) , which are \ 
indexed as y„l and y„2, respectively */ \ 

"m" (r), "m" (v)) 

#define ALIGN16 __attribute__ ((aligned (16))) 

void vadd_(int *n, double *coeff , double *r, double *v) 
{ 

int i, len; 

static double cc [2] ALIGN16; 

/* the array cc is aligned to 16-byte boundary */ 

cc[0] = cc[l] = *coeff; 
sse_load(cc [0] ) ; 
len = (*n)/2; 

for (i=0; i<len*2; i+=2) { 
sse_add(r[i] , v[i]); 

} 

if (*n % 2 != 0) 

r[len*2] = r[len*2] + cc [0] * v[len*2] ; 

} 

Note that we have added the keyword " __volatile__" ( an GNU extension ) in the macro 
" __asm__" . Its purpose is to ensure that the compiler does not rearrange the order of execution 
of the codes during compilation. Finally, all object modules are linked by gcc with the option 
"-lg2c". 

4.3 The implementation of H^j times \v) 

In our lattice QCD program, most of the execution time is spent in solving quark propagators 
via the nested CG loops. Thus the execution time is dominated by the operation times 
which is performed many times ( > 10^ in most cases ) before the final results of quark 
propagators can be obtained. Thus it is crucial to optimize this operation with SSE2 codes. 

First, we have to set up the correspondence between the data structures used by C and 
Fortran routines in our program, in particular, for the link variables and the relevant vectors 
in the nested CG loops. 
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Suppose we write the arrays of link variables and a column vector v in the syntax of 
Fortran as 

u{i,j,IJ,,x) , 
v{i, k, x) , 

where i and j are the color indices, is the space-time direction, k is the spinor index, and x 
is the site index. Now the question is how to access the elements of these arrays in C routines. 
To resolve this problem, we define some data structures in C as follows. 

/* SU(3) matrix, (c01,c02) forms the complex number of ull, and 

(c03,c04) of u21, etc. */ 

typedef struct { 

double cOl, c02, c03, c04, c05, c06; 

double c07, c08, c09, clO, cll, cl2; 

double cl3, cl4, cl5, cl6, cl7, clB; 
} su3_t ; 



/* there are 4 link variables at each site. */ 
typedef struct { 

su3_t mul, mu2, mu3, mu4; 
} ulink_t; 

/* SU(3) vector, (cl,c2) forms the complex number of vl, 
(c3,c4) of v2, and (c5,c6) of v3. */ 

typedef struct { 

double cl, c2, c3, c4, c5, c6; 
} vector_t; 



/* SU(3) Dirac spinor. */ 
typedef struct { 

vector_t si, s2, s3, s4; 
} spinor_t; 

Then the correspondence can be easily established. For example, the elements m(3, 2, 1, x) 
and t>(2,4,x) can be accessed by C routines as (-ufxj.mul.cll, M[x].mul.cl2) and ( f [x].s4.c3, 
u[x].s4.c4 ) respectively. 

Now we rewrite as 

Hw{x,y) = 75 |(4- mo)5x,y + ^ ^ [(-1 + 7M)f^M(^)'^^+M,!y " (1 + Tm)^^^!^ - /^)'5x-m,2/]| 

Then the multiplication of to a column vector \v) can be optimized by minimizing 
the number of multiplications involving the link variables. For example, the multiplication in 
(— 1 + 7i)m |w) can be written as ( in the spinor space ) 



(-1 + 7i)m \v) = 
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where all indices are suppressed except the spinor indices. It is clear that the vectors f 4 — f 1 
and f3 — V2 should be computed first, before they are multiplied by link variable u ( generic 
symbol for ). For example, the operation W4 — vi can be performed by the following 

macros with SSE2. 

#define mvpv(vl, v2) \ 

asm volatile ( \ 

'movapd 7.0, TUma.0 \n\t" \ 
'movapd 7.1, °/o7.xmml \n\t" \ 
'movapd 7.2, 7o7.xmm2 \n\t" \ 
'subpd 7.3, 7.7oxmmO \n\t" \ 
'subpd 7.4, 7.7oxmml \n\t" \ 
'subpd 7.5, 7.7.xmm2" \ 
: : \ 

'm" ((v2).cl), \ 
((v2).c3), \ 
((v2).c5), \ 
((vl).cl), \ 
((vl).c3), \ 
((vl).c5)) 



'm" 
'm" 
'm" 
'm" 
'm" 



Similarly, we have 



;-I -7i)m^ \v) 



(-1 + 72)n \v) 



(-1 - 72)^^ \v) 



-I + 73)n \v) 



(-1 - 73)n^ \v) 



r2 _ -^''"(^3 + V2) 
rs r2 

V r4 / V n J 
f n \ / -u{vi + W4) \ 

r2 _ -irs 

rs ~ -uivs + iv2) 
\r4 J \ -iri J 

f ri \ / -ir^ \ 

r2 _ -u\v2 + zwa) 
rs -ir2 

V r4 / V -v){va + ivi) ) 

( ri\ I u{vs - vi) ^ 

r2 _ -u{v2 + Va) 
rs -ri 

V ^"4 / V r2 j 
( r\\ I -v){vi + vs) \ 

r2 _ u\vi-V2) 

rs ~ ri 
\ri / \ -r2 ) 
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{-1 + J4:)U\V) = 



/ n\ 


( 




r2 




— zr4 


r3 




-u(i;3 + ivi) 
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-u{v4 + iV2) / 



r3 —iri 

) \ -ir2 J 



So the multiplications involving the link variables can be implemented as 

/* for each lattice size */ 
for (x=0; x<ldiin; x++) { 

/* prefetch for the current multiplication */ 

y = iup [x] .mul-1 ; 

_pref etch_su3(&(u[x] .mul)) ; 

_pref etch_spinor (& (v [y] ) ) ; 
/* prefetch for the next multiplication */ 

z = idn [x] .mul-1 ; 

_pref etch_su3(&(u[z] .mul)) ; 

_pref etch_spinor (& (v [z] ) ) ; 
/* rl.sl = u[x] .mul * (v[y].s4 - v[y].sl) */ 

mvpv(v [y] . si , v[y].s4); 

su3mul(rl . si , u[x].mul); 
/* rl.s2 = u[x] .mul * (v[y].s3 - v[y].s2) */ 

mvpv(v[y] . s2, v[y].s3); 

su3mul(rl . s2, u[x].mul); 
/* rl.s3 = -rl.s2 */ 

mvset(rl.s3, rl.s2); 
/* rl.s4 = -rl.sl */ 

mvset(rl.s4, rl.sl); 



/* prefetch for the next multiplication */ 

y = iup [x] .mu2-l ; 

_pref etch_su3(&(u[x] .mu2)) ; 

_pref etch_spinor(&(v[y] )) ; 
/* r2.sl = -(u[x] .mul) "{\dagger} * (v[y].sl + v[y].s4) */ 

mvmv(v [z] . si , v[z].s4); 

su3Hmul (r2 . si , u[z].mul); 
/* r2.s2 = -(u[x] .mul)~{\dagger} * (v[y].s2 + v[y].s3) */ 

mvmv(v[z] .s2, v[z].s3); 

su3Hmul (r2 . s2 , u[z].mul); 
/* r2.s3 = r2.s2 */ 

pvset(r2.s3, r2.s2); 
/* r2.s4 = r2.sl */ 

pvset(r2.s4, r2.sl); 
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where rl, r2, . . . , and v[] are declared as the type spinor_t, and u[] is declared as the type 
ulink_t. Note that prefetching has been inserted in order to attain the optimal performance. 
Finally, we have 8 vector segments rl, . . . , r8, and a diagonal term. They are summed over 
to give the final result of v[y\, 



v[y].sl 


= rl.sl + r2.sl + rS.sl - 

+ (4 — mo) * , 


- r4.sl + rS.sl + r6.sl + rl.sl + 


,.sl 




v[y].s2 


= rl.s2 + r2.s2 + r3.s2H 
+ (4 — mo) * f [x].s2 , 


- r4.s2 + r5.s2 + r6.s2 + r7.s2 + 


,.s2 




v[y].sZ 


= — (rl.s3 + r2.s3 + r3.s3 + r4.s3 + r5.s3 + r6.s3 + r7.s3 -|- 


r8 


s3 




+ (4 — mo) * t4x].s3) , 








v[y].sA 


= -(rl.s4 + r2.s4 + r3.s4 + r4.s4 + r5.s4 + r6.s4 + r7.s4 + 


rS 


s4 




+ (4 — mo) * t^^]-'^^) , 









Next we come to the question how to implement SSE2 codes for a SU{3) matrix times 
a vector, the most crucial part in Hy^ times \v). This problem has been solved by Liischer 
|19j . and his SSE2 codes is available in the public domain [20]. We found that Liischer's code 
is quite efficient, and have adopted it in our program. For completeness, we briefly outline 
Liischer's algorithm as follows. 

Consider 

^^11 ^^13 \ [ ( \ 

U21 U22 ^i23 X 2/2 = ^2 
^^31 U32 U33 J \ 2/3 / \ ^3 / 

First, the elements {yi, 2/2, 2/3) of the vector \y) are copied to the registers %xmmO, %xmml, 
and %xmm2, respectively. Then the real part of the SU(3) matrix {umn} is read sequentially, 
and is multiplied to \y) at %xmmO, %xmml, and %xmm2, and the result is stored at %xmm3, 
%xmm4, and %xmm5, 

%xmmO = (Re(?/i), Im(yi)) , 
%xmml = (Re(?/2), Im(?/2)) , 
%xmm2 = (Re(?/3),Im(y3)) , 
%xmm3 = (^1,^2), 
%xmm4 = (^3,^4), 
%xmm5 = (^5,^6) 5 

where 





= Re(Mii)Re(?/i) 


+ Re(Mi2)Re(y2) 


+ Re(Mi3)Re(|/3) , 


t2 


= Re(uii)Im(?/i) 


+ Re(ni2)Im(?/2) 


+ Re(ni3)Im(?/3) , 


h 


= Re(M2i)Re(?/i) 


+ Re(M22)Re(?/2) 


+ Re(M23)Re(2/3) , 




= Re(M2i)Im(?/i) 


+ Re(M22)Im(?/2) 


+ Re(M23)Im(?/3) , 


h 


= Re(M3i)Re(yi) 


+ Re(M32)Re(2/2) + Re(M33)Re(|/3) , 




= Re(M3i)Im(?/i) 


+ Re(M32)Im(?/2) 


+ Re(M33)Im(?/3) . 
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Lattice Size 


SSE2 off 


SSE2 on 


speed-up 


8^ X 24 


0.034 


0.018 


1.89 


10^ X 24 


0.065 


0.036 


1.81 


12^ X 24 


0.110 


0.063 


1.75 


16^ X 32 


0.328 


0.183 


1.79 



Table 1: The execution time (in unit of second) for multiplying a column vector F, with 
SSE2 turned on and off. The test is performed at a Pentium 4 (2 GHz) node. 



Next, multiply the vector y hj i = (0, 1), i.e., 

%xmmO (Im(?/i),Re(yi)) (-Im(?/i), Re(i/i)) , 

%xmml (Im(|/2),Re(?/2)) ^ (-Im(|/2), Re(i/2)) , 
%xmm2 (Im(?/3),Re(?/3)) ^ (-Im(?/3), Re(i/3)) , 

which is implemented by the following SSE2 code 

static int sn3 [4] ALIGN16 = {0x0,0x80000000,0x0,0x0}; 
#define su3mul(r, u) \ 

\ 

"xorpd 7o9, llxmmO \n\t" \ 
"xorpd 7o9, 7o°/oXininl \n\t" \ 
"xorpd 7o9, 7.7oXinm2 \n\t" \ 

\ 

:: \ 

\ 

"m" (sn3[0])); 

Then the imaginary part of {umn} is read and multiplied to iy, and the final result is 

%xmm3 = {ti + Si, ^2 + ^2) , 

%xmm4 = (^3 + S3, ^4 + S4) , 

%xmm5 = (ts + S5, + Sq) , 

where 





= -Im(Mii)Im(?/i) 


- Im(Mi2)Im(?/2) 


- Im(Mi3)Im(|/3 


S2 


= +Im(uii)Re(i/i) 


+ Im(ui2)Re(y2) 


+ Im('Ui3)Re(2/3 


S3 


= -Im(M2i)Im(|/i) 


- Im(u22)Im(?/2) 


- Im(M23)Im(|/3 


S4 


= +Im(M2i)Re(|/i) 


+ Im(M22)Re(?/2) 


+ Im(M23)Re(2/3 


S5 


= -Im(u3i)Im(|/i) 


- Im(u32)Im(?/2) 


- Im(M33)Im(|/3 


S6 


= +Im(u3i)Re(yi) 


+ Im(u32)Re(y2) 


+ Im(u33)Re(?/3 
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Arnoldi vectors 


Iterations 


Time 


40 


756 


12923 


50 


160 


4757 


60 


108 


4730 


70 


82 


4414 


80 


65 


4131 


on 
yu 


oo 


ZLI (T? 


100 


46 


4100 


120 


37 


4251 


140 


32 


4869 


160 


26 


5002 



Table 2: The execution time (in unit of second) for projecting 20 low-lying eigenmodes of 
using ARPACK, versus the number of Arnoldi vectors. The test is performed at a Pentium 4 
(1.6 GHz) node, for a gauge configuration on the 8^ x 24 lattice, at /5 = 5.8. Each eigenmode 
satisfies \\{H^ - X^)\x)\\ < lO'^^. 



5 Performance of the system 

In this section, we measure the performance of our system by a number of tests pertaining to 
the computation of quark propagators. 

In Table ^ we list the execution time (in unit of second) for multiplying a column 
vector Y, for both cases with SSE2 turned on and off, and for several lattice sizes. The data 
shows that turning on SSE2 can speed up our program by a factor ~ 1.8. 

In Table |2l we list the execution time (in unit of second) for projecting 20 low-lying 
eigenmodes of using ARPACK, versus the number of Arnoldi vectors. It is clear that 
there exists an optimal number of Arnoldi vectors for a projection, which of course depends 
on the gauge configuration. In Table|21 the optimal number is ~ 100, which amounts to ~ 240 
Mbyte for the 8^ x 24 lattice. However, for larger lattices such as 16^ x 32, the optimal number 
may require more than one gigabyte of memory. In this case, the projection of eigenmodes is 
carried out at some nodes with 2 gigabyte of memory. 

In Table El we measure the time used by disk I/O in our simple scheme of memory 
management for the nested CG loops, versus the number of bare quark masses. The test 
is performed at a Pentium 4 (2 GHz) node, for the 16^ x 32 lattice, and with 16 projected 
eigenmodes. The disk I/O time is the difference of the total execution time between two 
cases of turning on and off of the memory management. It is remarkable that the percentage 
of disk I/O time is only 3% of the total execution time even for 16 bare quark masses, 
and with 16 projected eigenmodes. Evidently, for the 16^ x 32 lattice, our simple scheme of 
memory management is more efficient and less expensive than any other options, e.g., parallel 
computing (with MPI) through a fast network switch. 

In Table m we list the execution time (in unit of second) for a Pentium 4 (2 GHz) node to 
compute 12 columns of quark propagators in a topologically nontrivial gauge background at 
13 = 5.8 on the 8^ x 24 lattice, versus the number of projected low-lying eigenmodes. Other 
parameters for the test are: the degree of Zolotarev rational polynomial is n = 16; the number 



16 



1 8 16 



CG time 491.9 

disk I/O time 7.6 

Total time 499.5 

disk I/O (%) 1.5% 



494.8 

8.9 
503.7 
1.8% 



497.1 
14.3 
511.4 
2.9% 



Table 3: The percentage of time spent in memory management (disk I/O) versus the number 
of bare quark masses (Nm)- The test is performed at a Pentium 4 (2 GHz) node, for the 
16^ X 32 lattice, and with 16 projected eigenmodes. The time (in unit of second) shown here 
is only for completing one outer CG iteration for one column of D~^. 



of bare quark masses is = 12; each projected eigenmode satisfies \\{H^ — A^)|x)|| < 10~^^, 
and the stopping criterion for inner and outer CG loops is e = 10~^^. The execution time 
is decomposed into three parts : (i) the projections of high and low-lying eigenmodes^; (ii) 
computing 12 columns of {DD'^)^^ via the nested CG loops; and (iii) computing and 
multiplying it to (Z)Z)^)~^. The total time is listed in the last column of the table. For 
completeness, we also list Xmax and Xmin of (after the projections), the total numbers 
of iterations of the outer CG loop and average iterations of the inner CG loop, as well as 
the precision of exact chiral symmetry in terms of cr ([T^ . Evidently, the time for projecting 
out the high and low-lying eigenmodes is only a very small fraction of the total execution 
time for computing 12 columns of quark propagators. However, the projections have very 
significant impacts on the total execution time since it yields the speed-up by a factor of 2.44, 
as comparing the first row (no projections) with the last row (projections of 40 low-lying 
eigenmodes). Moreover, with projections, the exact chiral symmetry can be easily preserved 
to a very high precision (a < 10"^'^). This suggests that one should project as many low-lying 
eigenmodes as possible, before executing the nested CG loops. In general, we suspect that the 
optimal number of projections depends on the projection algorithm, the amount of memory 
of the system, as well as the gauge configuration. 

In Table El we measure the precision of exact chiral symmetry a |T!?|l versus the degree 
(n) of Zolotarev optimal rational polynomial. The values of a listed in the second column of 
Table El are the maxima in the nested CG loops. The execution time and the iterations of the 
nested CG loops are also listed. Evidently, the precision of exact chiral symmetry a is quite 
different from the stopping criterion e = 10~^^ for inner and outer CG loops, since a can be 
much bigger or smaller than e, as shown in Table O as well as in Tables |31 and El It is clear 
that the necessary condition for preserving exact chiral symmetry to a very high precision is 
to use a higher degree (n) Zolotarev rational polynomial for (if^)~^/^. In Ref. ^21; tables 
are provided for looking up which degree n is required to attain one's desired accuracy in 
preserving the exact chiral symmetry on the lattice, versus the parameter b = X'^ax/^min of a 
given gauge configuration. 

In Table El we list the execution time (in unit of second) of a Pentium 4 (2 GHz) node to 
compute 12 columns of quark propagators, versus the size of the lattice. The parameters for 
the test are: the degree of Zolotarev rational polynomial is n = 16, the number of bare quark 

"^Note that the projection time hsted in the 2nd column of Table 0] includes 167 seconds for projecting 4 
highest eigenmodes of . 
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projections 






inner CG 


outer CG 


X sym. 




CG 


mult. 


Total 


# time 


Amin 


A max 


ave. iters. 


tot. iters. 


cr(max. 


) 


time 


time 


time 





0.017 


6.207 


965 


1282 


4.5 X 10-- 


10 


137221 


15070 


152291 


8 1573 


0.138 


6.207 


552 


1282 


5.2 X 10- 


14 


70908 


7828 


80309 


16 2753 


0.165 


6.207 


475 


1282 


5.7 X 10" 


14 


61543 


6803 


71099 


24 3703 


0.178 


6.207 


443 


1282 


4.3 X 10- 


14 


57792 


6374 


67869 


32 4725 


0.198 


6.207 


403 


1282 


5.3 X 10- 


14 


52961 


5864 


63550 


40 6524 


0.211 


6.207 


378 


1282 


6.0 X 10- 


14 


50301 


5581 


62406 



Table 4: The execution time for a Pentium 4 (2 GHz) node to compute 12 columns of quark 
propagators, versus the number of projected low-lying eigenmodes. The parameters for the 
test are : the lattice size is 8^ x 24; (3 = 5.8; the degree of Zolotarev rational polynomial 
is n = 16; the number of bare quark masses is = 12 and ma > 0.06; each projected 
eigenmode satisfies \\{H^ — A^)|a;)|| < 10^^^; and the stopping criterion for inner and outer 
CG loops is e = 10"^^ 



Zolo. 


X sym. 




CG iters. 


CG 


muh. 


Total 


degree 


cr(max. 


) 


inner 


outer 


time 


time 


time 


4 


1.7 X 10- 


-4 


367 


286 


82393 


4493 


86885 


8 


1.5 X 10^ 


~8 


402 


288 


111247 


6003 


117250 


10 


2.9 X 10- 


10 


408 


288 


120940 


6520 


127460 


12 


6.4 X 10- 


12 


411 


288 


129188 


6962 


136150 


16 


1.4 X 10- 


13 


414 


288 


148654 


8006 


156660 



Table 5: The precision of exact chiral symmetry a versus the degree (n) of the Zolotarev 
rational polynomial. The test is performed at a Pentium 4 (2 GHz) node, with the parameters: 
lattice size=16^ x 32; /3 = 6.0; the number of bare quark masses is Nm = 16; the number of 
projected eigenmodes is k = 20; each projected eigenmode satisfies \\{H^ — A^)|a;)|| < 10~^^; 
^ — Knax/Knin — 1086; and the stopping criterion for the CG loops is e = 10~^^. 
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masses is A^^ = 16 and ma > 0.02, each projected eigenmode satisfies — A^)|x)|| < 10^^^, 
and the stopping criterion for the inner and the outer CG loops is e = 10~^^. From the 
last entry of the last row, we can estimate that a Pentium 4 (2 GHz) node takes about 24 
days to complete 12 columns of quark propagators (for 16 bare quark masses) for one gauge 
configuration at /5 = 6.0 on the 16'^ x 32 lattice. In other words, if we have 12 nodes and let 
each one of them work on one column of D~^, then we can complete the quark propagators 
for one gauge configuration in two days. Since our system consists of 64 nodes, so we can 
compute quark propagators at a rate more than five gauge configurations per two days. 

6 Conclusions 

In this paper, we outline the essential features of a Linux PC cluster (64 nodes) which has been 
built at National Taiwan University, and discuss how to optimize its hardware and software 
for lattice QCD with exact chiral symmetry. At present, all nodes are working around the 
clock on lattice QCD computations. 

With Zolotarev optimal rational approximation to (if^)~^/^, projections of high and low- 
lying eigenmodes of H^, the multi-mass CG algorithm, the SSE2 acceleration, and our simple 
scheme of memory management, we are able to compute quark propagators of 16 bare quark 
masses on the 16^ x 32 lattice, with the precision of quark propagators up to 10~^^ and the 
precision of exact chiral symmetry up to 10~^^, at the rate of 2.5 gauge configuration {j3 = 6.0) 
per day, with our present system of 64 nodes. This demonstrates that an optimized Linux 
PC cluster can be a viable computational system to extract physical quantities from lattice 
QCD with exact chiral symmetry [HI 1^ 122] • 

The speed of our system is higher than 70 Gflops, and the total cost of the hardware is 
less than US$60000. This amounts to price/performance ratio better than $1.0/Mflops for 
64-bit (double precision) computations. The basic idea of optimization is to let each node 
work independently on one of the 12 columns of the quark propagators (for a set of bare 
quark masses), and also use the hard disk as the virtual memory for the vectors in the outer 
CG loop, while the CPU is working on the inner CG loop. Our simple scheme of memory 
management for the nested CG loops may also be useful to other systems. 

In future, we will add more nodes to our system, and will also work on larger lattices, say 
24'^ X 48. Then one Gbyte memory at each node is not sufficient to accommodate all relevant 
vectors in the inner CG loop, even for 12 Zolotarev terms. However, there are several ways 
to circumvent this problem. First, our memory management scheme is quite versatile, which 
is more than just for swapping the vectors at the interface of inner and outer CG loops. In 
fact, it can handle any number of Zolotarev terms for any lattice size, and can automatically 
minimize disk I/O at any step of the nested CG loops, according to the amount of physical 
memory of a node. As long as the percentage of the disk I/O time is less than 30%, it is 
still a better option than distributing the nested CG loops across the nodes and performing 
parallel computations (with MPI) through a fast network switch, since the communication 
overheads is expected to be more than 30% of the total time, especially for a system of 100 
nodes or more. Secondly, we can increase the amount of memory at each node, which depends 
on the specification of the motherboard as well as the price and the capacity of the memory 
modules. Finally, we can also exploit algorithms j2S| which only use five vectors rather than 
2n-|- 3 vectors for the inner CG loop, or the Lanczos algorithm as described in Ref. j21I. Now 



19 







projections 






X sym 




inner CG 


outer CG 


CG 


mult. 


disk I/O 


Total 


Lattice 




# 


time 




^max 


(T(max. 


) 


ave. iters. 


tot. iters. 


time 


time 


time 


time 


8^ X 24 


5.8 


32 


4725 


0.198 


6.207 


5.4 X 10^ 


14 


403 


1322 


54384 


7804 





66913 


10^ X 24 


5.8 


30 


7803 


0.152 


6.204 


6.4 X 10" 


14 


519 


1943 


191861 


18626 





218290 


12^ X 24 


5.8 


30 


13258 


0.129 


6.211 


9.8 X 10" 


14 


608 


2840 


574226 


38234 





625718 


16^ X 32 


6.0 


20 


74937 


0.215 


6.260 


3.3 X 10- 


13 


370 


3968 


1866172 


87890 


66976 


2095975 



Table 6: The execution time (in unit of second) of a Pentium 4 (2 GHz) node to compute 12 columns of quark propagators, 
versus the size of the lattice. The parameters for the test are: the degree of Zolotarev rational polynomial is n = 16, the number 
of bare quark masses is N^, = 16 and ma > 0.02, the precision of each projected eigenmode satisfies — A^)|x)|| < 10"^^, and 
the stopping criterion for inner and outer CG loops is e = 10~^^. 



it is clear that a Linux PC cluster is a viable platform to tackle lattice QCD with exact chiral 
symmetry even for a large lattice ( e.g., 32'^ x 64), though more studies are needed before one 
reaches an optimal design for dynamical quarks. 

Note added in proof: 

Recently, it has been shown j2S] that the speed of Neuberger's double pass algorithm [SB] 
for computing the matrix- vector product R^^~^'"'\H^) -Y is almost independent of the degree 
n of the rational polynomial, and it is faster than the single pass for n > 13 (for Pentium 4 
with SSE2). Thus the single pass has been replaced with the double pass algorithm in our 
Linux PC cluster. 
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